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ABSTRACT 


Accurately predicting which students are best suited for 
graduate programs is beneficial to both students and col- 
leges. In this paper, we propose a quantitative machine 
learning approach to predict an applicant’s potential perfor- 
mance in the graduate program. Our work is based on a 
real world dataset consisting of MS in CS students in the 
College of Computer and Information Science program at 
Northeastern University. We address two challenges associ- 
ated with our task: subjectivity in the data due to change of 
admission committee membership from year to year and the 
shortage of training data. Our experimental results demon- 
strate an effective predictive model that could serve as a 
Focus of Attention (FOA) tool for an admission committee. 
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1. INTRODUCTION 


Master’s education is the fastest growing and largest compo- 
nent of the graduate enterprise in the United States. Accord- 
ing to the 2016 joint survey conducted by the CGS (Council 
of Graduate Schools) and ETS (Educational Testing Ser- 
vice) [4], first-time enrollment in U.S. graduate programs 
reached a record high total of 506,927 students in Fall 2015. 
Because of the rise in applicants, the admissions process 
may become increasingly tedious and challenging. The ETS 
has established standardized tests (such as the GRE) to help 
evaluate applicants’ quantitative, reading, and writing skills, 
but these scores alone are far from indicative of success- 
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ful students. Although applicants’ previous achievements 
can demonstrate excellence, students with high GPAs from 
prestigious universities do not always excel in their graduate 
studies. 


In this paper, we take a quantitative machine learning ap- 
proach to predict the outlook of applicants’ graduate studies 
based on features extracted from their application materials. 
The training data for our model are empirically admitted 
students with their performance measures in the graduate 
program. In particular, we have a real world dataset from 
Northeastern University’s MS in Computer Science (MSCS) 
program, consisting of MS students from 2009 to 2012. We 
use a student’s overall GPA in the MSCS program as his/her 
performance measure. Our model aims to identify the top 
20% and bottom 20% performing students respectively (see 
details in Section 4.1). 


Two challenges arise when learning with this data. First, 
the data involves the admission committee’s (possibly sub- 
jective) evaluation. Specifically, some members of the com- 
mittee may be biased in weighing a particular set of stan- 
dards (e.g., GRE scores), while others may be in favor of 
different measures. This issue is particularly acute when 
the admission committee/policy changes from year to year. 
As a result, it can be difficult to form an accurate predictor 
directly from the entire dataset. Another challenge is the 
limitation of the training data. We have a total of 454 la- 
beled training samples (all admitted students) from 2009 to 
2012. On the other hand, we have over 2000 applications 
that are either rejected (i.e., not admitted) or declined (i.-e., 
admitted but not enrolled), which can serve as an unlabeled 
auxiliary dataset. Our conjecture is that building a semi- 
supervised model leveraging the large set of unlabeled data 
may lead to a superior performance compared to using the 
labeled data alone. 


Our model is inspired by two existing frameworks: SVM+ 
[12] and S3VM [3]. SVM-+ is a variant of SVM which ad- 
dresses the issue of heterogeneous data. Specifically, SVM+ 
implicitly establishes a different hyper-plane for each data 
subgroup by modifying a standard SVM’s objective func- 
tion and constraints. S3VM is a semi-supervised version of 
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SVM which learns a classifier using both labeled and unla- 
beled data. Our contribution is a new variant of SVM that 
unifies the advantages of both S3VM and SVM-+. Our new 
model, which we name 53VM-+, addresses the admission bi- 
ases in the labeled data and utilizes unlabeled applicants’ 
data simultaneously. S3VM-+ can be applied to any domain 
for which the data may have clearly defined subgroups (e.g., 
privileged domain knowledge) and a large amount of unla- 
beled data. 


An additional motivation of our research was to validate 
our hypothesis of whether we could predict student success 
based only on quantitative measures and, thus, remove the 
subjectivity of the committee reading the recommendation 
letters and statement of purpose. If successful, such a model 
will not only lead to a better selected student body, but 
also help to manage growing enrollments. Our experimental 
results (see Section 4 for details) demonstrate that, with 
our new model, we can achieve an effective yet imperfect 
prediction. Thus, in practice, our model could serve as a 
Focus of Attention (FOA) tool for the admission committees. 


The rest of the paper is organized as follows: in Section 2, 
we present the related work in predicting students’ perfor- 
mance in the education domain. In Section 3, we give brief 
introductions to S3VM and SVM+ and present our model 
S3VM-4 in detail. We demonstrate the efficacy of our model 
in Section 4 by comparing its performance to those three 
existing models. Finally, we conclude in Section 5. 


2. RELATED WORK 


Most EDM studies focus on predicting students’ academic 
performance after they have been admitted to the college or 
program. For example, Lepp et al. investigated the relation- 
ship between cell phone use and academic performance in a 
sample of US college students [8]. Delen applied machine 
learning techniques for student retention management [6]. 
Joanna et al. presented a dropout prediction in e-learning 
courses using machine learning techniques [10]. Neverthe- 
less, another important aspect of educational research is se- 
lecting the best fitting students at admission time, which 
has not been widely addressed in past literature. 


The most closely related work to our paper is the admissions 
research conducted by the University of Texas at Austin (UT 
Austin) for their graduate admission program [14], driven 
in part by their need to manage growing application num- 
bers. In their work, the authors applied logistic regression 
(LR) to help the admission committee identify weak candi- 
dates who will likely be rejected and exceptionally strong 
candidates who will likely be admitted. Our work bears 
a similar mission but is different in three aspects. First, 
the UT Austin research includes credentials such as recom- 
mendation letters and statement of purpose, whereas our 
work strives to build a purely quantitative model relying 
only on non-subjective measures. Second, the recommen- 
dations made by UT Austin’s algorithm are based on an 
applicant’s likelihood of admission, whereas our model aims 
to predict the future performance of the applicants in the 
graduate program. Last, our model addresses human sub- 
jectivity in admission decisions. The contribution of our 
paper is a quantitative machine learning model to predict a 
candidate’s future performance at admission time. 


3. INTEGRATING SEMI-SUPERVISED SVM 
WITH DOMAIN KNOWLEDGE 


We choose our model based on the characteristics of our 
dataset and particular challenges involved in our task. In 
particular, we choose SVM and two existing frameworks: 
S3VM [3] and SVM+ [12]), as our baseline models. Our 
proposed model is a new variant of SVM, which is inspired 
by S38VM and SVM+. We first give brief introductions to 
S3VM and SVM+. We then describe our new model in 
detail in Section 3.3. 


3.1 S3VM (Semi-Supervised SVM) 

S3VM is semi-supervised SVM proposed by [3]. The model 
is learned using a mixture of labeled data (the training set) 
and unlabeled data (the auxiliary set). The objective is to 
assign class labels to the auxiliary set such that the “best” 
support vector machine (SVM) is constructed. In particular, 
given a labeled dataset LD = {x1,22,...,%7} and an unla- 
beled auxiliary dataset U = {2141,vi+42,...,2i4+n}, SSVM 
learns a classifier from both L and U using overall risk 
minimization (ORM) posed by Vapnik [13] (Chapter 10). 
Starting with the standard SVM formulation, S3VM adds 
two constraints for each data point in the auxiliary set U. 
One constraint calculates the misclassification error as if the 
point were placed in class 1, and the other constraint calcu- 
lates the misclassification error as if the point were placed 
in class -1. The objective function calculates the minimum 
of the two possible misclassification errors. The final mem- 
bership assignments of the instances in U correspond to the 
ones that result in a minimum total sum of slacks across all 
instances in the training set. Specifically, we have: 
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where C is the trade-off between maximizing the margin 
and total violations. 7;’s are the slacks for the labeled data, 
and €;’s and z;’s are the slacks for the unlabeled data hy- 
pothetically assigned to the positive and negative classes 
respectively. 


Equation (1) can be solved using mixed integer program- 
ming by applying the “large integer M” technique. The idea 
is to introduce a constant integer M>0 and a decision vari- 
able d;€{0,1} for each point x; in the auxiliary set U. d; 
indicates the class membership of x;. If dj=1, then the point 
is in class 1 and if d;=0, then the point is in class -1. The 
integer M is chosen sufficiently large such that if d;=0 then 
€;=0 is feasible for any optimal w and b. Likewise if d;=1, 
then z;=0. In other words, €; and z; can have at most one 
non-zero value no matter what class x; belongs to. Conse- 
quently, we could replace the min(€;, z;) in Equation (1) by 
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(€;+2,;). This results in the following formulation: 
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The solution to Equation (2) can be found using mixed in- 
teger programming products. In our experiment, we used 
CVX [1] and Gurobi [2] optimizers. Same as a standard 
SVM, S3VM classifies a new instance x* using sign(w*x+b). 


3.2 SVM+ 

Vapnik and Vashist [12] introduced SVM+, which is a vari- 
ant of SVM that addresses the issue of learning with het- 
erogeneous data. In their model, the authors developed a 
new paradigm to learn using privileged information (LUPI). 
The objective of SVM-+ is to take advantage of additional 
domain knowledge, and in particular data subgroups that 
may arise from different sources or due to labeling biases. 


Suppose the training data has t > 1 groups. We follow the 
notation in [9] and denote the indices of group r by 


T, = {ings RS stn ds 
All training samples can then be represented as: 


{{Xr, Yr}, r= Leis th 
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where {X,, ¥-} = {(@r1,Yri),+++3 (Lra,+Yrn, )f- To incorpo- 
rate the group information, SVM-+ defines the slacks inside 
each group by a unique correcting function: 


& = &r(xi) = br (ai, wr), 


Specifically, the correcting functions are defined as: 


i€Tp, r=1,...,t 


€,(xi) = Wr Xi + d,, 


Compared to a standard SVM, S3VM uses slack variables 
that are restricted by the correcting functions, and the cor- 
recting functions capture additional information about the 
data. Note that all of the data is used to construct the deci- 
sion hyperplane. The group information is only used to fine 
tune the slack variables. Formally, the objective function for 
SV M-+ is formulated as follows: 
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the || wy ||?’s. C is the trade-off between maximizing the 
margin and total violations. 


Liang and Cherkassky [9] further extended the SVM+ ap- 
proach to multi-task learning. In the SVM+MTL [9] frame- 
work, the data is partitioned into groups using privileged 
information similar to the SVM+ model. However, instead 
of making a correcting function for the slack variables, their 
model establishes a unique correcting function (i.e., a hyper- 
plane) for each group in addition to a shared common hyper- 
plane. In other words, the decision function for group r = 
1,...,t is as follows: 


f-(2) = (w-a +b) + (wr + dr) 


where w,b are the parameters for the common hyper-plane 
and w,,d, are the parameters for the correcting function for 
group r. The corresponding formulation of the quadratic 
optimization problem is as follows: 


t t 
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SVM+MTL is an adaptation of SVM+ for solving MTL 
problems. In our experiment, we applied the SVM+MTL 
framework because it provides more flexibility to learn a 
different decision plane for each year’s student data. 


For SVM+MTL, predicting the class label for a new given 
instance x” is not straightforward because its decision func- 
tion requires a group-dependent correcting function, and we 
do not know the group membership of test instances. To re- 
solve this problem, we predict the label for x* in each group 
and perform a majority vote over all predicted labels. Specif- 
ically, a test instance x* will be predicted in each group as 
follows: 


fola*) = sign|(w- a” +b) + (wr &* +4,)] 


where r = 1,...,¢ are the bias groups, and w, b, w,’s and 
d,’s are learned model parameters. The class membership 
for x* is determined by a majority vote over f;(x*)’s. 


3.3. S3VM+ 

Our new model, S3VM-+ leverages the unlabeled data and 
addresses the biases in the training data simultaneously. In 
particular, we train our model with a labeled dataset and an 
unlabeled auxiliary dataset. Furthermore, our data is parti- 
tioned into yearly groups because of the admissions commit- 
tee changes from year to year and thus may have different 
biases. For the labeled dataset, we incorporate the grouping 
information by establishing a correcting function for each 
group (constraints (a) and (b) in Equation (5)). 


For the unlabeled data, we introduce two slack variables €; 
and z; for each data point x; representing the slacks of plac- 
ing x; in the positive class and negative classes respectively. 
The objective function for S3VM-+ takes the minimum of the 
two slacks for each unlabeled instance and minimizes the to- 
tal sum of slacks across all training instances. We apply the 
“large integer M” technique (see Section 3.1 for details) and 
convert the constraint with a minimization function to two 
constraints over linear functions. Because both labeled and 
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unlabeled data are grouped by academic year, we apply the 
same correcting functions used for the labeled data to each 
corresponding annual group of unlabeled data (constraints 
(c) to (f) in Equation (5)). Formally, the optimization prob- 
lem for S3VM-+ is formulated as follows: 


t 
1 Crew A 2 
pei +5 So len | 
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where C' is the trade-off between maximizing the margin and 
total violations, and y is the trade-off parameter between 
|| w ||? and the || w, ||?’s. Note that constraints (a), (b) 
are for labeled instances and constraints (c) — (f) are for 
unlabeled instances. 


To classify a new instance x*, we follow the same approach 
as SVM+, which is to take a majority vote on class labels 
predicted by each group. 


4. EXPERIMENTAL RESULTS 


In this section, we first describe the process of constructing 
our training and testing dataset. We then discuss the meth- 
ods we used to conduct our experiments in Section 4.2. We 
present our analysis of our experiments in Section 4.3. 


4.1 Constructing the Training and Test Data 
We have a real world dataset consisting of students from the 
MSCS program at Northeastern University. Table 1 presents 
the features we collected from students’ applications for our 
experiment. Feature 1 contains the students’ undergradu- 
ate GPAs adjusted according to each individual university’s 
grading scale. For example, a 3.5 out of 5 and a 7 out of 
10 would result in the same value. Feature 10 contains self- 
reported values representing the maximum number of lines 
of programming written by the student prior to joining the 
MS program. Feature 12 contains the rankings of the un- 
dergraduate institutions where the students obtained their 
bachelor’s degrees. We classified the rankings into 4 cate- 
gories with 1 being the most prestigious. The classification 
was performed manually according to the Best Global Uni- 
versities list published by US News and World Report. The 
rest of the features are standardized test scores. Both the 
GRE and TOEFL had two versions of tests during 2009 - 
2012 which use different scoring scales. Both of these tests 
are converted to their new versions of scoring scales using 
conversion tables provided by the ETS [4]. 


As mentioned in Section 1, our task is to identify success- 
ful candidates at the point of admission. One measure of 
success is MS-GPA in the MS program (as distinct from the 
input feature 1 “Undergraduate GPA”). Indeed, a cumula- 
tive MS-GPA is the most widely used measure for students’ 


Table 1: Features Collected for Training 


Undergraduate GPA 

GRE Verbal 

GRE Quantitative 

GRE Analytical Writing 
TOEFL Total 

TOEFL Reading 

TOEFL Listening 

TOEFL Speaking 

TOEFL Writing 

Max # of Lines of Code Written 
Bachelor’s Degree in EECS (Yes/No) 
Undergraduate School Ranking 


RRR 
WDE BbeEmNankhwnR 


Table 2: Student Data Statistics 
Year Total Top 20% Bottom 20% Aux. Data 


2009 37 7 ie 431 
2010 89 18 17 503 
2011 132 28 27 705 
2012 196 51 42 948 


academic performance [11]. The labels in our training data 
are determined by the training instances’ percentiles in the 
overall MS-GPAs. Specifically, the top and bottom 20% 
students are labeled with class 1 and -1 respectively. The 
number 20% was intuitively chosen as an measure which sets 
the individuals apart from the average students. 


Note that we did not use a midpoint MS-GPA as a cutoff 
to separate the positive and negative classes, in order to re- 
duce the label noise. In particular, instances close to the 
average GPA are harder to categorize as good or bad stu- 
dents. Another intuitive approach is to define two hard 
MS-GPA thresholds for good versus bad performances, i.e., 
to have a MS-GPA above an upper threshold (e.g., > 3.8) 
for good students, and below a bottom threshold (e.g., < 
3) for bad students . A further investigation reveals that 
this approach is less effective for the following reason: dif- 
ferent instructors have different grading policies due to the 
nature of the courses. For some fundamental courses, an ’A’ 
means you are in the top 30% of a class, while for some other 
advanced courses, an ’A’ means you are in the top 10% of 
a class. Even for the same course in the same year with 
different sections, the instructors may choose to cooperate 
exams/grading or not. Because students have different in- 
structors and/or even take different courses, hard cutoffs are 
not an accurate reflection of a student’s abilities. 


Having stated this, on the other hand, if a student performs 
consistently in the top 20% in each class, this student will 
be among the top 20th percentile of the entire MS-GPA 
spectrum. The same can be said for those that perform 
consistently in the bottom 20th percentile. Identifying the 
factors that lead to this consistent success or underperfor- 
mance are of greatest interest to this research. Therefore, 
we used relative measures to label our positive and nega- 
tive training samples. For comparison purposes, we report 
our experimental results on both relative and hard cutoffs 
in Tables 5 and 6 respectively. 


‘We did experiment with splitting the two classes using the 
mean value of all MS-GPAs and the performance was not 
satisfactory as expected. 
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Table 3: Prediction Using 1Y Data 


Top20% Bot20% 
Train Teset MS-GPA MS-GPA Overall 
2009 2010 0.72 0.59 0.66 
2010 2011 0.64 0.70 0.67 
2011 2012 0.65 0.76 0.70 


Table 4: Predicting Using 10-fold Cross Validation 
| Test Accuracy 


coed Top20 Bot20 Overall 


2009 - 2011 | 0.70 0.71 0.71 | 0.79 0.79 0.79 


| Training Accuracy 


Top20 Bot20 Overall 


2009 - 2012 0.74 0.72 0.73 | 0.84 0.75 0.79 


Table 2 summarizes the distribution of students from 2009 
to 2012. Column “Total” is the total number of students 
enrolled in the corresponding year. Columns “Top 20% MS- 
GPA” and “Bottom 20% MS-GPA” are the total number 
of students in the top and bottom 20th percentile among 
their peers measured by the cumulative MS-GPAs . There 
is not an equal number of positive and negative instances 
for each year because there are multiple students with same 
MS-GPA. 


Both SVM+ and our model S3VM+ make use of an unla- 
beled auxiliary dataset. We collect the application data of 
rejected (i.e., not admitted) and declined (i.e., admitted but 
not enrolled) applicants as the auxiliary data. These data 
contain the same features as the labeled data, and the size 
distribution of auxiliary data from 2009 to 2012 is presented 
in the last column of Table 2. Our training data are all la- 
beled and unlabeled instances from 2009 to 2011, and our 
test data are labeled instances from 2012. 


4.2 Experimental Method 

We are interested in identifying the top and bottom 20% 
of candidates from an application pool based on the perfor- 
mance of the admitted students. Our first goal is to confirm 
our conjecture that there are biases in admission decisions 
from year to year. To this end, we conducted two exper- 
iments. The first experiment is to use the previous year’s 
data to predict the current year’s performance using a stan- 
dard SVM. For example, we would use class 2009’s data to 
predict class 2010’s performance, and class 2010’s data to 
predict class 2011’s performance. Table 3 presents the pre- 
diction accuracies for each year. We observe that, for 2010, 
the top 20% of students are easier to predict than the bot- 
tom 20%, whereas for 2011 to 2012, the situation is reversed. 
This lack of consistency and the low overall accuracies (up 
to 70%) suggest that there is no strong correlation of pred- 
icative patterns from year to year. Our second experiment is 
to apply a standard 10-fold cross validation on two datasets: 
all data from 2009 to 2011 and all data from 2009 to 2012. 
Because 2012 added a significant amount (89%) of instances, 
we would expect a noticeable increase in both the training 
and test accuracies if the data across different years conform 
to the same distribution. Table 4 summarizes the results of 
this experiment. We observe only a marginal improvement 
in overall test accuracy after adding instances from 2012 
and, more importantly, the overall fit of the data remains 
the same (79%). From these two experiments, we conclude 


Table 5: Performance Comparison with Relative Cutoffs 
Test Accuracy 


eae Top20 Bot20 Overall] Top20 Bot20 Overall 


| Training Accuracy 


SVM 0.73 0.71 0.72 0.79 0.80 0.79 
S3VM 0.75 0.74 0.74 | 0.81 0.82 0.81 
SVM+ 0.77 0.70 0.74 | 0.92 0.84 0.88 
S3VM-+ 0.82 0.72 0.77 | 0.95 0.89 0.92 


Table 6: Performance Comparison with Hard Cutoffs 
Test Accuracy 


MS-GPA >3.8 <3.4 Overall | >3.8 <3.4 Overall 
Model 


| Training Accuracy 


SVM 0.65 0.69 0.66 0.73 0.75 0.74 
S38VM 0.72 0.65 0.70 0.83 0.70 0.77 
SVM+ 0.75 0.64 0.71 0.92 0.75 0.84 
S3VM-+ 0.77 0.67 0.74 | 0.93 0.80 0.87 


that data across different academic years have different dis- 
tributions. We believe this year to year bias is due to the 
change in the membership of the admission committee. 


In light of above learned information, we partitioned the 
data by academic year and use them as the privileged groups 
in SVM+ and $83VM+. We take the union of labeled data 
from 2009 to 2011 as our labeled training data. The aux- 
iliary dataset is formed as the union of the corresponding 
auxiliary data from 2009 to 2011. We test and compare 
the performance of the four models (SVM, S3VM, SVM4, 
S3VM-+) in predicting labeled instances in 2012. 


The hyper-parameters are the trade-off constant C’ for all 
four models and y for SVM+ and S83VM+. We perform 
10-fold cross validation and grid search on the training data 
to select the hyper-parameters. We first use a coarse grid 
{0.01, 10, 1000} for C and refine the candidates after the 
initial search. The final list for C is {1, 10,100}. Following a 
similar procedure, our final search list for 7 is {0.01, 1, 100}. 
After the best hyper-parameters are selected, we train the 
corresponding model one more time using the entire training 
data and then apply the learned model to the test data and 
measure its performance. We report both training and test 
accuracies in Table 5. 


4.3 Analysis on Performance Measures 

Table 5 displays the main results of our experiment. First, 
we observe that the test accuracies for SVM on the pos- 
itive and negative classes are more balanced compared to 
the results in Table 3. There is also an improvement in 
the overall performance for SVM. This can be explained by 
the increased amount of training data used in our Table 5 
experiment. 


Second, we conclude that all three variants of SVM (S3VM, 
SVM-+, S3VM-+) are superior to standard SVM. Using SVM 
as a baseline measure: 


e S38VM improved slightly on the accuracies of both pos- 
itive and negative classes, which suggests that using 
auxiliary data has a positive impact on identifying 
both the good and bad students. This is consistent 
with the fact that the auxiliary data contain both de- 
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clined (i.e., admitted but not enrolled ) and rejected 
(i.e., not admitted) applicants, which could improve 
the accuracy of positive and negative classes respec- 
tively. 


e SVM+ demonstrated improvement on the positive side 
only, which indicates that the partition of bias groups 
by academic year is most effective in identifying the 
top students. One explanation for this could be that 
the top 20% of students are inherently different from 
year to year, while the bottom 20% of students remain 
similar. Or that a particular admissions committee has 
biases about how to recognize a strong student. 


e Our model 53VM-+ has a noticeable advantage among 
all models in predicting the positive class: 83% versus 
73% (SVM), 75% (S3VM) and 77% (SVM+). In light 
of the construction of S83VM-+, one could conclude that 
adding auxiliary data to each partition group further 
enhances the power of identifying top students. On the 
other hand, because grouping does not have a signifi- 
cant impact on identifying bottom students (as demon- 
strated by SVM+), S3VM+ would only result in a lim- 
ited gain for the negative class. 


Lastly, from the training accuracies presented in Table 5, we 
observe a significantly better fit of the training data using 
our model $3VM-+. In particular, 95% versus 92% (SVM-+), 
81% (S3VM), 79% (SVM) accuracies for the positive class 
and 89% versus 84% (SVM+), 82% (S3VM) and 80% (SVM) 
accuracies for the negative class. Compared to the standard 
SVM, $3VM improved training accuracies evenly on both 
classes, and SVM+ and S3VM-+ demonstrated more signif- 
icant gains on the positive class, which is consistent with 
what we observed in the test data. 


4.4 Labeling Strategy: Relative v.s. Absolute 
Recall that in Section 4.1, we discussed our choice of label- 
ing the top 20% and bottom 20% of students with respect 
to their MS-GPAs as our two classes. We explained our ra- 
tionale of using relative rather than hard cutoffs to label our 
data. We confirm this conjecture in Table 6, where we show 
the results of an experiment using MS-GPA > 3.8 for the 
top students and MS-GPA < 3.4 for the other. In the table 
we see that for all four methods, the overall accuracies are 
lower than in Table 5. 


4.5 Analysis on Weight Vectors 

Because we utilized a linear SVM and its variants, we found 
it interesting to investigate the ranking and magnitude of 
each individual feature in the weight vectors produced by 
each model. Table 7 presents the ranking of w,’s in the 
weight vectors (w’s) of four models. Figure 1 displays the 
weights of individual features across four models using their 
magnitudes. In order to make a meaningful comparison, 
each weight vector w = {wi, w2,...,w1i2} is scaled by the 
maximum absolute value of its components. Thus, the weight 
for the most important feature is either 1 or -1. Note that, 
for SVM+ and $3VM4, we display the shared hyper-plane 
vector w without the correcting functions for each group. 


From Table 7, we observe that all models except standard 
SVM suggest the same top two features: “GRE Quanti- 
tative” and “GRE Analytic Writing” scores. Furthermore, 


Table 7: Weights Ranking Comparison 


Model |R1 R2 R3 R4 R5 RE R7 RB RI 


R12 


SVM W1 Wil W4 W12 W3 Wo W7 W2 We We Ws 
S3VM W4 W3 Wil Wil WE W2 WT W1i0 WO W12 WS 
SVM+ |w3 w4 Wo Wo Wit Wi W7 W5, Ws W12 WE 
S3VM+ |w3 wa wil w2 Wo wW7 Wl Wi2 W5 WE wWi0 


Figure 1: Weights Distribution Over 12 Features Across 
Four Models 
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The weights are normalized using w = ae (aly 
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SVM+ and S38VM-4 overlapped in their top five features 
but with a different ranking order. 


From Figure 1, we conclude that the most important fea- 
tures are 1 (“Undergraduate GPA”), 2 (“GRE Verbal”), 3 
(“GRE Quantitative”), 4 (“GRE Analytical Writing”) and 
11 (“Bachelor’s Degree in EECS (Yes/No)”). A closer ex- 
amination reveals that SVM relies mostly on three features 
(1, 4, and 11). S3VM has significantly large weights on two 
additional features, 6 (“TOEFL Reading”) and 7 (“TOEFL 
Listening”), on top of the five features listed above. SVM+ 
and $3VM+ made use of one additional feature which is 9 
(“TOEFL Writing”). 


5. CONCLUSIONS 


In this paper, we applied a quantitative machine learning 
approach to predict candidates’ potential academic perfor- 
mances based on information from their applications. We 
built our model using empirically admitted students with 
their cumulative GPAs as performance measures and tested 
our model’s efficacy for the incoming students. Through- 
out our experiments, we found a unique challenge associ- 
ated with our task, which is different data distributions 
across the academic years due to biases arising from chang- 
ing membership of the admissions committee. We addressed 
this issue with the Learning Using Privileged Information 
(LUPI) framework. We further handled the limited train- 
ing data issue by employing a semi-supervised version of 
SVM to utilize the large amount of unlabeled data (i.e., 
the rejected/declined applications). Our resulting model, 
S3VM-4, is a novel variant of SVM that addresses subjectiv- 
ity and lack of labeled data simultaneously. Our experimen- 
tal results demonstrate a significant gain of our model com- 
pared to three existing models in standard literature (i.e., 
standard SVM, S3VM, and SVM+). Although we based 
our work on a two-year master’s program, our model is eas- 
ily extensible to similar tasks such as college or pre-school 
admissions. Our model can also be applied to other real 
world situations in which data may have clearly defined bi- 
ased subgroups and a large amount of unlabeled data. 
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